Skip to content

Conversation

@rescrv
Copy link
Contributor

@rescrv rescrv commented Oct 30, 2025

Description of changes

We need to put both tokens in both numeric and string form into the
outputs for statistics functions.

Test plan

CI

Migration plan

N/A

Observability plan

N/A

Documentation Changes

N/A

@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@propel-code-bot
Copy link
Contributor

propel-code-bot bot commented Oct 30, 2025

Add string label alongside index for SparseVector statistics

Adds optional string labels to sparse-vector statistics so that each token is emitted in both numeric (index) and human-readable (label) form. This required extending StatisticsValue::SparseVector to carry an Option<String>, updating all match arms, hash/eq implementations, display helpers, record-building logic, deserialisation in load_existing_statistics, and metadata emitted by the statistics executor.

Key Changes

• Changed StatisticsValue::SparseVector from SparseVector(u32) to SparseVector(u32, Option<String>)
• Added helper methods stable_value_index(), stable_value_label(), stable_value_string() and updated type_prefix()/stable_type()
• Updated from_metadata_value, equality, hashing, and display impls to handle new tuple form
• Emit new metadata field value_label when label present; reader path now parses it to remain backward-compatible
• Refactored record-emission to use stable_value_index/stable_value_string and include value_label when present
• Adjusted tests and parser to support optional labels for legacy statistics

Affected Areas

rust/worker/src/execution/functions/statistics.rs (all statistics generation and parsing logic)
• Metadata schema for statistics records (value_label field added)

This summary was automatically generated by @propel-code-bot

@blacksmith-sh

This comment has been minimized.

@rescrv rescrv force-pushed the rescrv/use-sparse-vector-tokens branch from a2ec476 to a0b964a Compare October 31, 2025 19:44
@rescrv rescrv requested a review from tanujnay112 October 31, 2025 20:05
Copy link
Contributor

@tanujnay112 tanujnay112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming should be improved

@rescrv rescrv force-pushed the rescrv/sparse branch 2 times, most recently from beb0dbc to 1279718 Compare November 22, 2025 04:03
Comment on lines 101 to 114
@@ -156,7 +113,7 @@ impl SparseVector {
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The implementation of from_pairs can be simplified by using unzip, which is more idiomatic and likely more efficient. I've also renamed the parameter from triples to pairs for clarity.

Suggested change
pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self {
let (indices, values) = pairs.into_iter().unzip();
Self {
indices,
values,
tokens: None,
}
}
Context for Agents
The implementation of `from_pairs` can be simplified by using `unzip`, which is more idiomatic and likely more efficient. I've also renamed the parameter from `triples` to `pairs` for clarity.

```suggestion
    pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self {
        let (indices, values) = pairs.into_iter().unzip();
        Self {
            indices,
            values,
            tokens: None,
        }
    }
```

File: rust/types/src/metadata.rs
Line: 114

@rescrv rescrv force-pushed the rescrv/use-sparse-vector-tokens branch from 7865879 to ce0a380 Compare November 22, 2025 18:18
Comment on lines 250 to 256
if let Some(stable_value_token) = stats_value.stable_value_token() {
metadata.insert(
"value_token".to_string(),
UpdateMetadataValue::Str(stable_value_token),
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[TestCoverage]

This new logic to add value_token is a great addition. However, it doesn't appear to be covered by tests. The existing tests for sparse vectors seem to only cover cases where tokens are not provided, exercising the None path for stable_value_token.

To ensure this new functionality is robust, could you please add a test case with a SparseVector that includes tokens? This test should assert that the value_token field is correctly populated in the resulting statistics records. This would likely require updating test helpers like extract_metadata_tuple as well.

Context for Agents
This new logic to add `value_token` is a great addition. However, it doesn't appear to be covered by tests. The existing tests for sparse vectors seem to only cover cases where tokens are not provided, exercising the `None` path for `stable_value_token`.

To ensure this new functionality is robust, could you please add a test case with a `SparseVector` that includes tokens? This test should assert that the `value_token` field is correctly populated in the resulting statistics records. This would likely require updating test helpers like `extract_metadata_tuple` as well.

File: rust/worker/src/execution/functions/statistics.rs
Line: 255

@rescrv rescrv force-pushed the rescrv/use-sparse-vector-tokens branch from ce0a380 to 36086d2 Compare November 22, 2025 21:31
Comment on lines 603 to 604
// Wrap in Arc to avoid cloning large MaterializeLogOutput data
let log_fetch_records_clone = log_fetch_records.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Resource leak in concurrent compaction handling: The clone() operations at lines 603-604 create deep copies of log_fetch_records and entire CompactionContext for parallel execution. With large datasets (e.g., 10k records), this duplicates significant memory without cleanup guarantees if one future fails.

// Current: Clones entire state
let log_fetch_records_clone = log_fetch_records.clone();
let mut self_clone_fn = self.clone();

// Safer approach: Use Arc to share data
let log_fetch_records = Arc::new(log_fetch_records);
let fn_records = Arc::clone(&log_fetch_records);
Context for Agents
**Resource leak in concurrent compaction handling**: The `clone()` operations at lines 603-604 create deep copies of `log_fetch_records` and entire `CompactionContext` for parallel execution. With large datasets (e.g., 10k records), this duplicates significant memory without cleanup guarantees if one future fails.

```rust
// Current: Clones entire state
let log_fetch_records_clone = log_fetch_records.clone();
let mut self_clone_fn = self.clone();

// Safer approach: Use Arc to share data
let log_fetch_records = Arc::new(log_fetch_records);
let fn_records = Arc::clone(&log_fetch_records);
```

File: rust/worker/src/execution/orchestration/compact.rs
Line: 604

.run_compaction(collection_id, system)
.await;
let result = Box::pin(compaction_context.run_compaction(collection_id, system)).await;
Box::pin(compaction_context.cleanup()).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Potential panic from unbounded recursion: Box::pin() at line 723 wraps an async function that can recursively call itself through orchestrator chains. With deep call stacks (e.g., multiple attached functions), this risks stack overflow despite the 8MB thread stack allocation in tests.

// Add depth tracking to prevent unbounded recursion
pub async fn compact(
    // ... params
    max_depth: Option<u32>,
) -> Result<CompactionResponse, CompactionError> {
    if let Some(depth) = max_depth {
        if depth == 0 {
            return Err(CompactionError::InvariantViolation(
                "Maximum orchestration depth exceeded".to_string()
            ));
        }
    }
    // ... rest of function
}
Context for Agents
**Potential panic from unbounded recursion**: `Box::pin()` at line 723 wraps an async function that can recursively call itself through orchestrator chains. With deep call stacks (e.g., multiple attached functions), this risks stack overflow despite the 8MB thread stack allocation in tests.

```rust
// Add depth tracking to prevent unbounded recursion
pub async fn compact(
    // ... params
    max_depth: Option<u32>,
) -> Result<CompactionResponse, CompactionError> {
    if let Some(depth) = max_depth {
        if depth == 0 {
            return Err(CompactionError::InvariantViolation(
                "Maximum orchestration depth exceeded".to_string()
            ));
        }
    }
    // ... rest of function
}
```

File: rust/worker/src/execution/orchestration/compact.rs
Line: 723

Comment on lines 666 to 671
for hnsw_index_uuid in self.hnsw_index_uuids {
let _ = HnswIndexProvider::purge_one_id(
self.hnsw_provider.temporary_storage_path.as_path(),
hnsw_index_uuid,
)
.await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Missing error propagation in cleanup path: cleanup() at line 666 uses let _ = to ignore HNSW purge errors. If cleanup fails during error handling (e.g., disk full), subsequent compactions will leak resources without visibility.

pub(crate) async fn cleanup(self) {
    let mut cleanup_errors = Vec::new();
    for hnsw_index_uuid in self.hnsw_index_uuids {
        if let Err(e) = HnswIndexProvider::purge_one_id(
            self.hnsw_provider.temporary_storage_path.as_path(),
            hnsw_index_uuid,
        ).await {
            cleanup_errors.push((hnsw_index_uuid, e));
        }
    }
    if !cleanup_errors.is_empty() {
        tracing::warn!("Cleanup failures: {:?}", cleanup_errors);
    }
}
Context for Agents
**Missing error propagation in cleanup path**: `cleanup()` at line 666 uses `let _ =` to ignore HNSW purge errors. If cleanup fails during error handling (e.g., disk full), subsequent compactions will leak resources without visibility.

```rust
pub(crate) async fn cleanup(self) {
    let mut cleanup_errors = Vec::new();
    for hnsw_index_uuid in self.hnsw_index_uuids {
        if let Err(e) = HnswIndexProvider::purge_one_id(
            self.hnsw_provider.temporary_storage_path.as_path(),
            hnsw_index_uuid,
        ).await {
            cleanup_errors.push((hnsw_index_uuid, e));
        }
    }
    if !cleanup_errors.is_empty() {
        tracing::warn!("Cleanup failures: {:?}", cleanup_errors);
    }
}
```

File: rust/worker/src/execution/orchestration/compact.rs
Line: 671

@rescrv rescrv changed the base branch from rescrv/sparse to main November 23, 2025 02:03
Comment on lines +173 to +175
(Self::SparseVector(lhs1, lhs2), Self::SparseVector(rhs1, rhs2)) => {
lhs1 == rhs1 && lhs2 == rhs2
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

For consistency with the other match arms in this PartialEq implementation, you can remove the braces and use a single expression with a trailing comma.

Context for Agents
For consistency with the other match arms in this `PartialEq` implementation, you can remove the braces and use a single expression with a trailing comma.

File: rust/worker/src/execution/functions/statistics.rs
Line: 175

StatisticsValue::Float(value) => value.to_bits().hash(state),
StatisticsValue::Str(value) => value.hash(state),
StatisticsValue::SparseVector(value) => value.hash(state),
StatisticsValue::SparseVector(value, token) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

token -> label

@rescrv rescrv merged commit 2d85922 into main Nov 23, 2025
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants